near-optimal reinforcement learning
Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes
A dynamic treatment regime (DTR) consists of a sequence of decision rules, one per stage of intervention, that dictates how to determine the treatment assignment to patients based on evolving treatments and covariates' history. These regimes are particularly effective for managing chronic disorders and is arguably one of the key aspects towards more personalized decision-making. In this paper, we investigate the online reinforcement learning (RL) problem for selecting optimal DTRs provided that observational data is available. We develop the first adaptive algorithm that achieves near-optimal regret in DTRs in online settings, without any access to historical data. We further derive informative bounds on the system dynamics of the underlying DTR from confounded, observational data. Finally, we combine these results and develop a novel RL algorithm that efficiently learns the optimal DTR while leveraging the abundant, yet imperfect confounded observations.
Near-Optimal Reinforcement Learning with Self-Play
This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires \tlO(S^2AB) steps of game playing, when only highlighting the dependency on (S,A,B). In contrast, the best existing lower bound scales as \Omega(S(A+B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity \tlO(SAB), and a new Nash V-learning algorithm with sample complexity \tlO(S(A+B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.
- North America > United States > Oregon > Benton County > Corvallis (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Near-optimal Reinforcement Learning in Factored MDPs
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\Omega(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces. This implies $T = \Omega(SA)$ time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, $S$ and $A$ can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a \emph{factored} MDP, it is possible to achieve regret that scales polynomially in the number of \emph{parameters} encoding the factored MDP, which may be exponentially smaller than $S$ or $A$. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).
Near-optimal Reinforcement Learning in Factored MDPs
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will su er (Ô SAT) regret on some MDP, where T is the elapsed time and S and A are the cardinalities of the state and action spaces. This implies T = (SA) time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, S and A can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a factored MDP, it is possible to achieve regret that scales polynomially in the number of parameters encoding the factored MDP, which may be exponentially smaller than S or A. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)
Reviews: Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes
Update after rebuttal: Due to author comments and, in particular, discussions with the other reviewers, I have updated my score from 4 to a weak accept 6. For the future draft, aside from the revisions and clarifications the authors have promised in the rebuttal, I recommend the following (slight) modifications to improve the manuscript: The motivation in the introduction would be strengthened by drawing clearer connections to the real world. The authors should consider picking a specific real world example and illustrating the method through that example (even if it's not possible to provide simulation results on such an example). In line with this, the authors should be careful about discussion of safe-RL. Typically such methods involve use of constraints to ensure safety, but it does not appear the authors explicitly use or discuss such methods here.
Reviews: Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes
In this paper, the authors provide a method for incorporating observational data (possibly subject to unobserved confounding) to improve the performance of policy learning in online settings (crucial theorems are 5,7 and 8). After a period of discussion, the reviewers came to a consensus that this paper merits publication in NeurIPS, and will contribute to the RL literature by giving a principled method of incorporating observational data, even if confounded.
Review for NeurIPS paper: Near-Optimal Reinforcement Learning with Self-Play
Additional Feedback: *) Is there a reason to mention algorithm 1? it seems algorithm 2 gives improved performance relatively to it. If so, why presenting the two algorithms and not just algorithm 2? *) Although equation 9 can be thought of as a set of n m linear constraints, why the optimization problem is always feasible? Although the authors devoted half a page to explain on this procedure, I feel it is not well explained. Most of the discussion is not devoted to explaining the policy certification procedure. Why for a fixed \mu the best response is not markovian?
Review for NeurIPS paper: Near-Optimal Reinforcement Learning with Self-Play
After reading the reviews and authors' responses, it seems the only main concern raised is the lack of experiments. My opinion is that while experiments would be nice to have, the lack of experiments is not a significant concern if the theoretical results are strong enough. In my own assessment of the paper, I find the theoretical results to be indeed quite a strong contribution to the field (they provide the first algorithm to match the PAC lower bound, for a problem which has quite a few previous works). The reviewers seem to agree with this point in their reviews. I, therefore, recommend that the paper be accepted.